System and Method for Whiteboard and Audio Capture

ABSTRACT

A system that captures both whiteboard content and audio signals of a meeting using a digital camera and a microphone. The system can be retrofit to any existing whiteboard. It computes the time stamps of pen strokes on the whiteboard by analyzing the sequence of captured snapshots. It also automatically produces a set of key frames representing all the written content on the whiteboard before each erasure. The whiteboard content serves as a visual index to efficiently browse the audio meeting. The system not only captures the whiteboard content, but also helps the users to view and manage the captured meeting content efficiently and securely.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of a prior application entitled“SYSTEM AND METHOD FOR WHITEBOARD AND AUDIO CAPTURE” which was assignedSer. No. 10/178,443 and filed Jun. 19, 2002.

BACKGROUND

1. Technical Field

This invention is directed toward a system and method for recordingmeetings. More particularly, this invention is directed towards a systemand method for capturing both the whiteboard content and audio of ameeting.

2. Background Art

Meetings constitute a large part of many workers' working time. Makingmore efficient use of this time spent in meetings translates into a bigincrease in productivity.

Many meeting scenarios use a whiteboard extensively for brainstormingsessions, lectures, project planning meetings, patent disclosures, andso on. Note-taking and copying what is written on the board ofteninterferes with many participants' active contribution and involvementduring these meetings. As a result, some efforts have been undertaken tocapture whiteboard content in some automated fashion.

Several technologies have been developed to capture the whiteboardcontent automatically. One of the earliest, the whiteboard copier, is aspecial whiteboard with a built-in copier. With a click of a button, thewhiteboard content is scanned and printed. Once the whiteboard contentis on paper, it can be photocopied, faxed, put away in the file cabinet,or scanned into digital form.

More recent technologies attempt to capture the whiteboard content indigital form from the start. They generally fall into twocategories—those that capture images of the whiteboard and those thattrack pen location and infer whiteboard content therefrom.

The devices in the first category capture images of the whiteboarddirectly. National Television System Committee (NTSC)-resolution videocameras are often used because of their low cost. Since these camerasusually do not have enough resolution to clearly capture what is writtenon a typical conference room size whiteboard, several video frames mustbe stitched together to create a single whiteboard image. Another devicein this first category is the digital still camera. As high resolutiondigital cameras get cheaper, taking snapshots of the board with adigital camera becomes a popular choice.

Devices in the second category track the location of the pen used towrite on the whiteboard at high frequency and infer the content of thewhiteboard from the history of the pen coordinates. Sometimes theyinclude an add-on device attached to the side of a conventionalwhiteboard and use special cases for the dry-ink pens and eraser. Eachpen emits ultrasonic pulses when pressed against the board. Tworeceivers at the add-on device use the difference in time-of-arrival ofthe audio pulses to triangulate the pen coordinates. Since the historyof the pen coordinates is captured, the content on the whiteboard at anygiven moment can be reconstructed later. The user of this type ofwhiteboard recording can play back the whiteboard content like a movie.Because the content is captured in vector form, it can be transmittedand archived with low bandwidth and small storage requirements.

Electronic whiteboards also use pen tracking technology. They go onestep further than the systems using the previously discussed add-ondevices by making the whiteboard an interactive device. The user writeson a monitor with a special stylus that is tracked by the computer. Thecomputer renders the strokes on the screen wherever the stylus touchesthe screen—as if the ink is deposited by the stylus. Because the strokesare computer generated, they can be edited, re-flowed, and animated. Theuser can also issue gesture commands to the computer and show othercomputer applications on the same screen.

Electronic whiteboards, however, currently still have limitedinstallation base due to their high cost and small sizes (the size of anelectronic whiteboard rarely exceeds 6 feet in diagonal). Furthermore,systems with pen-tracking devices have the following disadvantages: 1)If the system is not on or the user writes without using the specialpens, the content cannot be recovered by the device; 2) Many people liketo use their fingers to correct small mistakes on the whiteboard insteadof the special eraser. This common behavior causes extra strokes toappear on the captured content; 3) People have to use special dry-inkpen adapters, which make them much thicker and harder to press, for someof the devices; and 4) Imprecision of pen tracking sometimes causesmis-registration of adjacent pen strokes.

Besides the work discussed above with respect to whiteboard capturemethods, a great amount of research has been done on the capture,integration, and access of the multimedia experience, especially withrespect to lectures and meetings. People have developed techniques andsystems that use handwritten notes, whiteboard content, slides, ormanual annotations to index the recorded video and audio for easyaccess.

For example, in a project called the Classroom2000 project, Abowd et al.used an electronic whiteboard to time-stamp the ink strokes so that theviewers (students) could use the ink strokes as the indexes to therecorded video and audio. Key frames (called pages) were computed basedon the erasure events provided by the electronic whiteboard. TheClassroom2000 project, however, required an electronic whiteboard. Withan electronic whiteboard, there are many disadvantages from the enduser's point of view. First of all, most offices and meeting rooms donot have electronic whiteboards installed. Secondly, it has been shownthat people find it is much more natural to use a regular whiteboardthan an electronic whiteboard. Thirdly, images captured with a cameraprovide much more contextual information such as who was writing andwhich topic was discussing (usually by hand pointing). In addition tothese disadvantages, electronic whiteboards can be costly and are thusnot readily available.

SUMMARY

The present invention is directed toward a system and process thatovercomes the aforementioned limitations in systems for capturingwhiteboard content and associated audio.

The Whiteboard Capture System differentiates from the above systems thatcapture images of the whiteboard directly in that it computes the timestamps of pen strokes and key frames by performing analysis on thecaptured images. Key frame images contain all of the important contenton the whiteboard and serve as a summary to the recording. They can becut and pasted to other documents or printed as notes. The time stampsand key frames are effective indices to the recorded audio.Additionally, the Whiteboard Capture System invention employs anordinary whiteboard, not an electronic whiteboard, like some otherwhiteboard capture systems. Thus, the system can be used with anyexisting whiteboard without modification.

The Whiteboard Capture System captures a sequence of images of contentwritten on a non-electronic white board with a camera. It simultaneouslyrecords audio signals of the meeting. Once the recording is complete,the image sequence is analyzed to isolate the key frames that summarizethe key points of the contents written on the whiteboard. The audiorecordings are correlated to the pen strokes on the key frames by timestamps which are associated with both the recorded audio and the imagesequence. These time stamps are computed through image analysis.

The general analysis process for obtaining key frames involvesrectifying the whiteboard view in every image in the sequence of images.The whiteboard background color is also extracted and each image of thesequence of images is divided into cells. Cell images that are the sameover time are clustered together as will be explained in more detaillater. Each cell image is then classified as a stroke, a foregroundobject or whiteboard cell. Key frame images are then extracted using theclassification results. Cell images can be spatially and temporallyfiltered to refine classification results prior to key frame extraction.Additionally, the key frame images, once extracted, can be colorbalanced to improve image quality.

More specifically, rectifying the whiteboard view involves cropping anynon-whiteboard region of each image. The four corners of the whiteboardare then specified in each image. A bi-linear warp is then performed foreach image using bi-cubic interpolation to obtain a cropped andrectified whiteboard image in each captured image.

Two methods may be used for extracting whiteboard background color. Thefirst method involves determining the whiteboard cells with thebrightest luminance and smallest variance. The color with the brightestluminance and the smallest variance is designated as the whiteboardbackground color. Once the whiteboard background color is thusdetermined, any holes in whiteboard color are found and filled bysearching the whiteboard cells around each hole. Each hole's color isthen set to the color of the nearest cell that is not a hole.

The second method for extracting whiteboard background color involveshistogramming the whiteboard image luminance and determining peakwhiteboard luminance. The color corresponding to peak luminance isdesignated as the initial whiteboard color. Any whiteboard coloroutliers (erroneous data) are then determined using a least-mediansquares technique. These outliers are marked as holes and are filled inthe same manner as in the first method of determining whiteboard colordiscussed above. The whiteboard color image can be filtered afterfilling each hole.

The process of dividing each image in the input sequence into cellsimprove the analysis processing speed. Typically each image is dividedinto cells such that the cell size is approximately the same size of asingle character on the board. This is equivalent to 1.5 inches by 1.5inches, or 25 pixels by 25 pixels for a typical conference sizewhiteboard. Alternately, however, all analysis can be performed on apixel per pixel basis.

Once the sequence of input images are rectified and the whiteboard colorhas been determined, the cell images are clustered. Cell images that areconsidered to be the same over time are clustered together in groups. Anormalized cross-correlation technique and a Mahalanobis distance testare used to determine if two cells are the same.

The cell-classifying process determines whether a cell image is awhiteboard cell, a stroke or a foreground object. A cell image isdesignated as a whiteboard cell if the red, green, blue (RGB) values areapproximately the same. Alternately, a cell image is designated as astroke cell if the cell is mostly white or gray with one or two primarycolors mixed in. Otherwise, the cell image is designated as a foregroundcell. The cell classifying process determines the color distribution ofthe current cell image and the color distribution of the correspondingwhiteboard cell. The cells are then classified based on if the colordistribution of the current cell image and the corresponding whiteboardcell are the same, not the same but have a strong similarity, or aretotally different.

The above classification procedure only uses the color information in asingle cell. More accurate results can be achieved by utilizing spatialand temporal relationship among cell groups. In spatial filtering, twooperations are performed on every whiteboard image. First, isolatedforeground cells are identified and reclassified as strokes. Second,stroke cells which are immediately connected to some foreground cellsare reclassified as foreground cells. With respect to temporalfiltering, the basic observation is that it is virtually impossible towrite the same stroke in exactly the same position after it is erased.In other words, if for any given cell, the cell images of two differentframes contain the same stroke, then all the cell images in between thetwo frames must have the same stroke unless there is a foreground objectblocking the cell. At the temporal filtering step, this cell will beclassified as a stroke as long as it is exposed to the camera before andafter the foreground object blocks it.

The key frames can then be extracted. To this end, the classificationresults are used and the stroke cells are counted for each image orframe in the sequence of images. The peaks and valleys of the strokecount are determined. If the difference between each adjacent peak andvalley of the stroke count exceeds a prescribed threshold, the databetween the valleys are designated as chapters (each chapter signifyinga different topic) and the peak within each chapter as the key framerepresenting the chapter.

The key frame images are then reconstructed. This involves inputting theclassified cell images and the key frames divided into cell images. If akey frame cell image is classified as a whiteboard image or a strokeimage, its image is rendered as a whiteboard image or a stroke image,respectively. Alternately, if a key frame foreground cell image iswithin the span of a stroke, this cell image is rendered with the strokecell image from neighboring images in the sequence. If the key framecell image is not classified as a whiteboard image, a stroke image or aforeground cell within the span of a stroke, it is rendered as awhiteboard image.

Color balancing can then be used to improve the image quality of the keyframe images by making the background uniformly white and increasing thecolor saturation of pen strokes by using mean whiteboard color to scalethe color of each pixel in a cell. Image noise is also reduced.

After the analysis server processes the image sequence and produces theindex and key frame images, it sends emails to the registered sessionparticipants with the Uniform Resource Locator (URL) (the “address” orlocation of a Web site or other Internet service) of the processedrecording. The users can click on the URL to launch the browsingsoftware. The browser allows users to view the key frame images andquickly access the audio associated with a particular topic.

The User Interface (UI) of the browsing software has various components.The primary elements of the browser UI include a key frame pane wherekey frame thumbnails are displayed, and the main display pane of thebrowser that shows a composition of the raw image from the camera andthe current key frame image.

The key frame pane also incorporates a background transparency sliderthat allows the user to adjust the image displayed in the main displaypane from the raw input image to the key frame image. Current penstrokes, strokes that have already been written in the meeting playbacktimeline, are rendered darker and more clearly than future strokes. Thepen-strokes that the participants are going to write in the future areshown in a ghost-like style. This visualization technique is realizedusing the following process. The current whiteboard content is renderedusing the key frame image of the current chapter and time stampinformation. Then future strokes are rendered, converted to gray scaleand blurred using a Gaussian filter. These two images are then added,and the resultant image is alpha-blended with the rectified image fromthe input sequence. The user can control the alpha value with the GUIslider from 0, showing only the rendered keyframe whiteboard image, to1, showing exactly the original image.

A VCR and standard timeline control is provided in the lower left cornerof the browser UI, below the main display pane. The VCR and standardtimeline control allow the user to sequence backwards or forwards slowlyor quickly in the image/audio sequence or to stop, much like thecontrols found on a typical video cassette recorder (VCR). A timelinebar graphically displays the length of the audio/image sequence as abar, and provides numerical values of the start time, end time andcurrent time of the meeting playback. A pointer on this bar can beselected and dragged forward and backward along the timeline bar tolinearly sequence forwards and backwards in the image/audio sequence.

Two levels of non-linear access to the recorded audio are provided inthe context of visual indexing. The first level of non-linear access isthrough the use of key frame thumbnails. The user can click a key framethumbnail to jump to the starting point of the audio (e.g., beginning ofthe chapter) for the corresponding key frame. Each key frame has a timerange associated with it that assists the user in determining the timerange associated with that particular key frame. The second level ofnon-linear access to the recorded audio is through the use of the penstrokes in each key frame. When the cursor is hovering over a pen strokecell (current stroke cell or future stroke cell) in the main window, thecursor is changed to a “hand” symbol indicating that it is selectable(e.g., “clickable” with a mouse). Double clicking on the cell with amouse or other input device brings the application to the audio playbackmode. The playback starts from the time of the session when the clickedstroke cell was written. The user can still click other stroke cells tojump to other part of the session. Together with the VCR and standardtime line control 1514, these two levels of visual indexing allow theuser to browse a meeting in a very efficient way.

As stated previously, the thumbnails of the key frame images are listedin the key frame pane. Selecting one of the thumbnails brings thecorresponding key frame image to the main window at the left and takesthe application to the image viewing mode, where the user can zoom inand out using the zoom control buttons, read the text and diagrams inthe image, or cut and paste a portion of the image to other documents.Additionally, the entire key frame can be cut and pasted to otherdocuments or printed as notes.

In the Whiteboard Capture System, meeting participants are asked toregister with the capture software at beginning of the meetingrecording. All the recorded sessions reside on a web server. If no oneregisters, the meeting is posted on a publicly accessible web page. Ifat least one participant registers, an access token is generated afterthe meeting recording and analysis. The token is a long randomlygenerated string containing a unique meeting identifier. The URLcontaining the token is emailed to the registered participants. Therecipients go to the URL to launch the web browsing software to reviewthe meeting. They can also forward the URL to people who have notattended the meeting.

The above-described basic Whiteboard Capture System can be combined withmany other techniques and devices to render alternate embodiments. Inone such embodiment, conventional Optical Character Recognition (OCR) isperformed on the key frames to provide editable text that is easily usedto create documents or presentation viewgraphs.

In another embodiment, conventional voice recognition software is usedto convert the audio portion of the captured data to text. This allowsthe easy creation of meeting minutes and other documents. It alsoprovides a relatively inexpensive way to provide meeting information tothe hearing impaired.

The Whiteboard Capture System can also be made portable using, forexample, a notebook computer with a microphone and a camera mounted on atripod. This configuration only requires an additional initialcalibration to determine the location of the camera relative to thewhiteboard. This calibration could be performed manually by manuallydetermining the four corners of the panel in the image, orautomatically, by using such methods as edge detection.

The analysis software of the Whiteboard Capture System can also be usedto determine key frames with the whiteboard capture systems that use pentracking to infer whiteboard content. Using the Whiteboard CaptureSystem analysis software with such a system simplifies the analysisprocess. There is no determination of whiteboard background color orwhiteboard region rectification necessary, no spatial and temporalfiltering required, and the classification of whiteboard cells issimpler because cell images will either be stroke or whiteboard, sinceno foreground object will interfere with the text written on thewhiteboard.

Additionally, to achieve higher frame rate, a high resolution videocamera such as a HDTV camera can be used instead of a still camera.

In yet another embodiment, the Whiteboard Capture System incorporatesgesture recognition to use gesture commands. For instance, a command boxcan be written somewhere on whiteboard. When the user motions or pointto the box the system uses gesture recognition to time stamp the imagesat the particular time the gesture was made.

The white board capture system relieves meeting participants of themundane note-taking task, so they can focus on contributing andabsorbing ideas during meetings. By providing key frame images thatsummarize the whiteboard content and structured visual indexing to theaudio, the system helps the participants to review the meeting at alater time. Furthermore, people who did not attend the meeting can oftenunderstand the gist of the meeting in a fraction of the time.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the present inventionwill become better understood with regard to the following description,appended claims, and accompanying drawings where:

FIG. 1 is a diagram depicting a general purpose computing deviceconstituting an exemplary system for implementing the invention.

FIG. 2 is a diagram depicting three main components of the white boardcapture system—the capture unit, analysis server and the browsingsoftware. This diagram was captured using a prototype whiteboard capturesystem.

FIG. 3 is a schematic of a white board capture system according to thepresent invention.

FIG. 4 is a series of images showing selected frames from an input imagesequence.

FIG. 5 is a flow chart depicting the image analysis process of thesystem and method according to the invention.

FIG. 6A is a first technique of computing whiteboard color.

FIG. 6B is a second technique of computing whiteboard color.

FIG. 7 is a series of images showing whiteboard color extractionresults. The left image is the result of the first strategy of computingwhiteboard color, the middle image is the result of the second strategyof computing whiteboard color, and the right image shows the actualblank whiteboard image.

FIG. 8 is a flow chart depicting the cell classification process of thesystem and method according to the present invention.

FIG. 9 is a series of samples of the classification results. The imagesabove correspond to the images in FIG. 5 after cropping andrectification.

FIG. 10 is a plot of the number of strokes vs. time for the sequence inFIG. 4.

FIG. 11 is a flowchart depicting the general process used to select keyframes from a sequence of input images.

FIG. 12 is a flow chart depicting the process of identifying chaptersand key frames in the system and method according to the presentinvention.

FIG. 13 is a flow chart depicting the process of reconstructing keyframe images in the system and method according to the presentinvention.

FIG. 14 is a flow chart depicting the process of color balancing the keyframe images in the system and method according to the presentinvention.

FIG. 15 is an image depicting the browser interface of the white boardcapture system. Each key frame image represents the whiteboard contentof a key moment in the recording.

FIG. 16 is a flow chart depicting the process of displaying current andfuture pens strokes in the system and method according to the presentinvention.

FIG. 17 is a flow chart depicting the security processing used in thesystem and method according to the present invention.

FIG. 18A provides sample images of whiteboard content taken at threeinstallation sites of a working embodiment of the invention.

FIGS. 18B, 18C and 18D is a series of figures depicting the input (FIG.18B) and output, key frame images, (FIGS. 18C and 18D) of a workingembodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following description of the preferred embodiments of the presentinvention, reference is made to the accompanying drawings that form apart hereof, and in which is shown by way of illustration specificembodiments in which the invention may be practiced. It is understoodthat other embodiments may be utilized and structural changes may bemade without departing from the scope of the present invention.

1.0 Exemplary Operating Environment

FIG. 1 illustrates an example of a suitable computing system environment100 on which the invention may be implemented. The computing systemenvironment 100 is only one example of a suitable computing environmentand is not intended to suggest any limitation as to the scope of use orfunctionality of the invention. Neither should the computing environment100 be interpreted as having any dependency or requirement relating toany one or combination of components illustrated in the exemplaryoperating environment 100.

The invention is operational with numerous other general purpose orspecial purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like.

The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

With reference to FIG. 1, an exemplary system for implementing theinvention includes a general purpose computing device in the form of acomputer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus also known as Mezzanine bus.

Computer 110 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 110 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 110. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of the any of the aboveshould also be included within the scope of computer readable media.

The system memory 130 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

The computer 110 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through a non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 110 through input devices such as akeyboard 162 and pointing device 161, commonly referred to as a mouse,trackball or touch pad. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit120 through a user input interface 160 that is coupled to the system bus121, but may be connected by other interface and bus structures, such asa parallel port, game port or a universal serial bus (USB). A monitor191 or other type of display device is also connected to the system bus121 via an interface, such as a video interface 190. In addition to themonitor, computers may also include other peripheral output devices suchas speakers 197 and printer 196, which may be connected through anoutput peripheral interface 195. Of particular significance to thepresent invention, a camera 192 (such as a digital/electronic still orvideo camera, or film/photographic scanner) capable of capturing asequence of images 193 can also be included as an input device to thepersonal computer 110. Further, while just one camera is depicted,multiple cameras could be included as an input device to the personalcomputer 110. The images 193 from the one or more cameras are input intothe computer 110 via an appropriate camera interface 194. This interface194 is connected to the system bus 121, thereby allowing the images tobe routed to and stored in the RAM 132, or one of the other data storagedevices associated with the computer 110. However, it is noted thatimage data can be input into the computer 110 from any of theaforementioned computer-readable media as well, without requiring theuse of the camera 192.

The computer 110 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer180. The remote computer 180 may be a personal computer, a server, arouter, a network PC, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 110, although only a memory storage device 181 has beenillustrated in FIG. 1. The logical connections depicted in FIG. 1include a local area network (LAN) 171 and a wide area network (WAN)173, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks, intranetsand the Internet.

When used in a LAN networking environment, the computer 110 is connectedto the LAN 171 through a network interface or adapter 170. When used ina WAN networking environment, the computer 110 typically includes amodem 172 or other means for establishing communications over the WAN173, such as the Internet. The modem 172, which may be internal orexternal, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on memory device 181. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

The exemplary operating environment having now been discussed, theremaining parts of this description section will be devoted to adescription of the program modules embodying the invention.

2.0 Whiteboard Capture System and Method.

2.1 System Architecture.

Conceptually, the Whiteboard Capture System consists of three primarycomponents: a capture unit 202, an analysis/processing server 204, andbrowsing software 206, as shown in FIG. 2.

1. Capture Unit:

The capture unit is used to capture images of the whiteboard content andto record the audio associated with the creation of the whiteboardcontent. The capture unit is installed in a room where meetings takeplace. As shown in FIG. 3, it includes a digital camera 302, amicrophone 304, and a personal computer (PC) 306. The capture unit takesimages of the whiteboard 308 and records audio via the microphone 304that is stored to the PC 306. Both the images taken and the audio aretime stamped. The images and the audio samples are obtained at a commonclock, usually the system clock. The timing of the common clock isassociated with the images and audio samples and is stored as their timestamps.

2. Analysis Server:

The analysis server 204 is located in a central place and analyzes andstores the recorded image data. In one embodiment, an analysis programis launched automatically after the user stops the recording in thecapture unit. After processing the recorded data, emails containing theURL where the meeting recording is available are sent to the registeredparticipants. If there are no registered users the meeting recording canbe posted to a publicly-accessible web-site.

3. Browsing Software:

The browsing software 206 allows the user to view and play back therecorded and analyzed meeting data. The browsing software 206 ispreferably provided as a web plug-in to be installed by the users whowish to view the meeting recordings. Once installed, the users can clickthe aforementioned URL to launch the software to access the data on theanalysis server.

2.2 Image Acquisition

The input to the Whiteboard Capture System is a set of still digitalimages. FIG. 4 shows an exemplary set of such images. The image sequenceis analyzed to determine when and where the users wrote on the board andto distill a set of key frame images that summarize the whiteboardcontent throughout a session.

Any relatively high resolution camera that allows camera control by acomputer can be used for image acquisition. The camera is preferablymounted at either the side or the back of a meeting room. The camera iszoomed in as close to the whiteboard as possible to maximize theeffective resolution. The camera is stationary after the installationand the assumption is made that the whiteboard does not move, so thewhiteboard images are stationary throughout the captured sequence.

If a camera that is used has only auto focus mode, the whiteboard mightbecome out-of-focus if an object in front of the whiteboard triggers theattention of the auto focus mechanism of the camera. This problem can bemitigated by aligning the image plane of the camera as parallel to thewhiteboard as possible to minimize scene depth and/or minimizing theaperture to increase the depth of field. In practice, only 1-2% of theframes were observed to be out-of-focus in a working embodiment of theWhiteboard Capture System.

The camera takes the pictures as fast as it can and transfers the imagesto the PC, preferably via a USB connection. One JPEG image was obtainedabout every 5 seconds in a working embodiment of the Whiteboard CaptureSystem. The exposure and white-balance parameters are typically keptconstant. Assuming the light setting does not change within one session,the color of whiteboard background should stay constant in a sequence.

It was found that slightly under exposed images give better colorsaturation, which makes the stroke extraction process to be discussedlater more accurate. A color-balancing step can be performed afterrecording to make the grayish whiteboard images more appealing.

2.3 Image Sequence Analysis

Since a person who is writing on the board is in the line of sightbetween the digital camera and the whiteboard, he/she often obscuressome part of the whiteboard and casts shadow on the other parts. It isnecessary to distinguish among strokes, the foreground object (e.g.,person writing on the board), and the whiteboard. Once theclassification results are known, the key frame images and an index canbe used by the browsing software.

Rather than analyzing the images on a per-pixel level (although thiscould be done) the whiteboard region is divided into rectangular cellsto lower the computational cost. The cell size is chosen to be roughlythe same as the expected size of a single character on the board (about1.5 by 1.5 inches, or 25 by 25 pixels, in a working embodiment). Sincethe cell grid divides each frame in the input sequence into cell images,the input can be thought of as a three dimensional matrix of cell images(e.g., x, y, time). The division of each image into cells is typicallyperformed after the input images have been rectified.

As shown in FIG. 5 and below, the general process actions of theprocedure to analyze the input image sequence are as follows:

-   -   1. Rectify the whiteboard region of every image in the sequence        (process action 502).    -   2. Extract the whiteboard background color (process action 504).    -   3. Cluster the cell images throughout the sequence for the same        cell, after dividing every image in the sequence into        corresponding cell images (process action 506). If two cell        images are considered to be the same over time, they are        clustered in the same group.    -   4. Classify each cell image as a stroke, a foreground object, or        the whiteboard (process action 508).    -   5. Filter the cell images both spatially and temporally to        refine the classification results (process action 510).    -   6. Extract the key frame images using the classification results        (process action 512).    -   7. Color-balance the key frame images (process action 514).

In the following paragraphs, the running example as shown in FIG. 4 isused to illustrate the input image sequence analysis procedure.

2.3.1 Rectify the Whiteboard Images

Before feeding the image sequence to the stroke extraction process, thenon-whiteboard region is cropped and the images are rectified. Becausethe lens of the camera used in the working embodiment has fairly lowradial distortion, it is only necessary to identify the four corners ofthe whiteboard (otherwise it might be necessary to correct for radialdistortion via conventional methods prior to rectifying the images).This is done manually by clicking on the location of the four corners ofthe whiteboard in a captured image during a one-time calibration step,although this could be done automatically (e.g., by using edgedetection). With the four corners, a simple conventional bi-linear warpis performed for each image in the sequence using bi-cubic interpolationto obtain a cropped and rectified whiteboard view in each capturedimage.

2.3.2 Computing the Whiteboard Color

For the classification of the cells, it is necessary to know what thewhiteboard color is (that is, the color of the whiteboard itself withoutanything written on it) for each cell. The whiteboard color is also usedfor white-balancing in producing the key frames, so it should beestimated accurately to ensure the high quality of the key frame images.

Two strategies have been used for computing whiteboard color. The firststrategy, outlined in FIG. 6A, is based on the assumption that thewhiteboard cells have the brightest luminance over time and have smallvariance (i.e., almost uniform within each cell). This is reasonablesince the color of the strokes (red, green, blue or black) will lowerthe luminance. As shown in process action 602, the whiteboard cells withthe brightest luminance and smallest variance are computed. This,however, may produce holes in the final whiteboard color image. Forexample, if a cell either contains a stroke or is blocked by aforeground object throughout the sequence, the whiteboard color computedfor this cell will not be correct (this cell appears different from therest of whiteboard, and thus looks like a hole). To this end, as shownin process action 604, any holes in the whiteboard color image aredetected by using a technique called least-median-squares (similar tothe outlier detection method as described in the next paragraph). Theholes are then filled. (process action 606). To fill a hole, itsneighborhood is searched, and the whiteboard color is set to that of thenearest cell that is not a hole. This strategy usually works quite well,but it fails when a person wears a white T-shirt and/or holds a piece ofwhite paper. The left image of FIG. 7 shows the result of the whiteboardcolor image computed from the input sequence in FIG. 4, where a personwas holding a white paper in some of the frames. It can be seen that thecomputed whiteboard color is corrupted by the white paper.

The second strategy of determining whiteboard color is shown in FIG. 6Band is more sophisticated. The assumption is that a significant portionof the pixels in each cell over time belongs to the whiteboard. Bybuilding a histogram of the luminance for each cell, the colorcorresponding to the peak with a high luminance value is very likely thecolor of the whiteboard for this cell. The first step is therefore tobuild a histogram for each cell and compute peak luminance (processactions 610 through 614) to compute an initial whiteboard color in thisway. This technique works even if a cell contains a stroke throughoutthe sequence, but it fails in the case when a person wears a whiteT-shirt and/or holds a piece of white paper, or when a cell is alwayshidden by people or other objects. In such cases, the computedwhiteboard color image contains outliers. The next action is to detectany outliers (process action 616). The outlier detection is based on arobust technique called least-median-squares. Assuming the color variessmoothly across the whiteboard, a plane is fit in the luminance Y or RGBspace by minimizing the median of the squared errors. The cells whosecolor does not follow this model are considered to be outliers andconsequently rejected, i.e., they are marked as holes (process action618). The interested reader is referred to the Appendix for the detailsof this technique. Next, as shown in process action 620, the holes arefilled by using the same procedure as in the first whiteboard colorcomputing strategy (process action 620). Finally, to further improve theresult, the whiteboard color image may be filtered by locally fitting aplane in the RGB space (process action 622). The interested reader isagain referred to the Appendix for details. The result obtained withthis new technique on the same example is shown in the middle image ofFIG. 7. Clear improvements are seen over the result obtained with thefirst strategy as shown in the left. The actual blank whiteboard is alsoshown in the right image for comparison.

2.3.3 Clustering Cell Images over Time

During the meeting, the content of each cell usually changes over time.For each cell, one would like to cluster all the cell images in the timesequence into groups, where each group contains the cell images that areconsidered to be the same over time. A modified NormalizedCross-Correlation algorithm is used to determine if two cell images arethe same or not. In the following, the Normalized Cross-Correlationtechnique is described using one color component of the image, but itapplies to all RGB components.

Consider two cell images I and I′. Let I and I′ be their mean colors andσ and σ be their standard deviations. The normalized cross-correlationscore is given by$c = {\frac{1}{N\quad\sigma\quad\sigma^{\prime}}{\sum\limits_{i}{\left( {I_{i} - \overset{\_}{I}} \right)\left( {I_{i}^{\prime} - {\overset{\_}{I}}^{\prime}} \right)}}}$where the summation is over every pixel i and N is the total number ofpixels. The score ranges from −1, for two images not similar at all, to1, for two identical images. Since this score is computed after thesubtraction of the mean color, it may still give a high value even iftwo images have very different mean colors. So an additional test isused on the mean color difference based on the Mahalanobis distance,which is given by d=| I− I′|/(σ+σ′). In summary, two cell images I andI′ are considered to be identical and thus should be put into the samegroup if and only if d<T_(d) and c>T_(c). In a working implementation ofthe Whiteboard Capture System, T_(d)=2 and T_(c)=0.707 were successfullyused.2.3.4 Classifying Cells

The cell-classifying process action determines whether a cell image is awhiteboard, a stroke, or a foreground object. The following heuristicsare used: 1) a whiteboard cell is uniform in color and is grey or white(i.e., the RGB values are approximately the same); 2) a stroke cell ismostly white or grey with one or two primary colors mixed in; 3) aforeground object does not have the characteristics above. Theclassification therefore determines whether the color distribution ofthe current cell image and the whiteboard color distribution are thesame, or not the same but having strong overlap, or totally different.Again, the Mahalanobis distance is used as described below.

Notice that the whiteboard color has already been computed as describedpreviously. Again, one color component of RGB is used as an example. LetI _(w) be the whiteboard color and σ_(w) be the standard deviation (itis a small value since a whiteboard cell is approximately uniform). LetI and σ be the mean and standard deviation of the current cell image.The cell image is classified as a whiteboard cell if and only if | I− I_(w)|/(σ+σ_(w))<T_(w) and σ/σ_(w)<T_(σ); as a stroke cell if and only if| I− I _(w)|/(σ+σ_(w))<T_(w) and σ/σ_(w)≧T_(σ); otherwise, it isclassified as a foreground object cell. In a working embodiment of theWhiteboard Capture System, T_(w)=2 and T_(σ)=2 were successfully used.

2.3.5 Filtering Cell Classification

The above classification procedure only uses the color information in asingle cell. More accurate results can be achieved by utilizing spatialand temporal relationship among the cell groups.

2.3.5.1 Spatial filtering.

With respect to spatial filtering, the basic observation is thatforeground cells should not appear isolated spatially since a personusually blocks a continuous region of the whiteboard. In spatialfiltering, two operations are performed on every single whiteboardimage, as shown in FIG. 8. First, isolated foreground cells areidentified and reclassified as strokes (process action 802). Second,stroke cells which are immediately connected to some foreground cellsare reclassified as foreground cells (process action 804). One mainpurpose of the second operation is to handle the cells at the boundariesof the foreground object. If such a cell contains strokes, the secondoperation will incorrectly classify this cell as a foreground object.Fortunately, however, the following temporal filtering corrects suchpotential errors.

2.3.5.2 Temporal Filtering.

With respect to temporal filtering, the basic observation is that it isvirtually impossible to write the same stroke in exactly the sameposition after it is erased. In other words, if for any given cell, thecell images of two different frames contain the same stroke, then allthe cell images in between the two frames must have the same strokeunless there is a foreground object blocking the cell. This observationis very useful to segment out the foreground objects. Consider theexample in the previous section where a stroke cell at the boundary ofthe foreground object is incorrectly classified as a foreground cell. Atthe temporal filtering step, this cell will be classified as a stroke aslong as it is exposed to the camera before and after the foregroundobject blocks it.

FIG. 9 shows the classification results for the sample images in FIG. 4,where the strokes are in green, the foreground is in black, and thewhiteboard is in white.

2.3.6 Key Frame Image Extraction

Key frame images contain all the important content on the whiteboard andserve as a summary to the recording. The user should expect the keyframe images to have the following properties: 1) They should captureall the important content on the board; 2) The number of the key framesshould be kept to a minimum; 3) They should only contain the pen strokesand the whiteboard, but not the person in front; 4) They should haveuniform white background and saturated pen colors for easy cut-and-pasteand printing.

The key frame extraction procedure uses the cell images classificationresults from the process actions previously described. The procedurefirst decides which frames in the sequence should be selected as keyframes; it then reconstructs the key frame images. This is described indetail below.

2.3.6.1 Key Frame Selection.

There is no unique solution in selecting the key frames—just as there isno single way to summarize a meeting. In the most general sense,referring to FIG. 11, the input image cells that have been classified asstroke, foreground or whiteboard are used (process action 1102). Themeeting is first divided into several chapters (topics) (process action1104). An erasure of a significant portion of the board content usuallyindicates a change of topic so it is used as a divider of the chapters.Then a key frame image representative of the whiteboard content iscreated for that chapter (process action 1106). The frame just before asignificant erasure starts is chosen as the key frame, which ensuresthat the content is preserved in these frames. The detailed procedure,shown in FIG. 12, works as follows:

-   -   1. The number of stroke cells for each frame in the sequence are        counted (process action 1202). One stroke cell image may span        multiple frames—it is included in the count for each of those        frames. FIG. 10 shows the stroke cell count plotted against        frame number in the example session (FIG. 4). A rise in the plot        indicates more strokes are written on the board, where a dip in        the plot indicates some strokes are erased. The graph is quite        noisy. There are two reasons: 1) The user is constantly making        small adjustments on the board; 2) The classification results        contain small errors.    -   2. Using the stroke count for the various frames the peaks and        valleys are determined (process action 1204). If a key frame is        produced at each dip, dozens of key frames will result. In order        to keep the number of key frames to a minimum, the data is        filtered to retain only the significant erasure events. The        procedure ignores the fluctuation in the data unless the        difference between the adjacent peak and valley exceeds a        certain threshold (process action 1206). Twenty percent of the        maximum stroke count was successfully used in a working        embodiment of the system.    -   3. The valleys in the data are then used to divide the session        into chapters (process action 1208). The frame containing the        peak within a chapter is chosen to be the key frame representing        the chapter.        2.3.6.2 Image Reconstruction.

Once the frames are selected, it is necessary to reconstruct the imagescorresponding to what the whiteboard looked like at these points intime. However, one cannot simply use the raw images from the inputsequence because they may contain foreground objects. The image isreconstructed by gathering the cell images in the frame. Referring toFIG. 13, the frames divided into cell images and the key frame dividedinto cell images are input (process action 1302). There are three casesdepending on the cell classification:

-   -   1. If a key frame cell image is whiteboard or stroke, its own        image is used (process actions 1304, 1306).    -   2. If the key frame foreground cell image is within the span of        a stroke (i.e., the person is obscuring the strokes on the        board. This is determined through temporal filtering during the        analysis phase), this cell image is replaced with the stroke        cell image from the neighboring frames (process action 1308,        1310).    -   3. Otherwise, as shown in process action s 1312 and 1314, a        foreground object must be covering the whiteboard background in        this cell, and is filled in whiteboard color computed as        discussed previously.        2.3.7 Key Frame Color Balance

The reconstruction process removes the person from the whiteboardimages, but the images still look like the raw images from the inputsequence: grayish and noisy. They can be color balanced to produce abetter image. The process consists of two steps:

-   -   1. Make the background uniformly white and increase color        saturation of the pen strokes. For each cell, the whiteboard        color computed as discussed previously, I _(w), is used to scale        the color of each pixel in the cell.        $I_{out} = {\min\left( {255,{\frac{I_{i\quad n}}{I_{w}} \cdot 255}} \right)}$    -    (process action 1402).    -   2. Reduce image noise. The value of each color channel of each        pixel in the key frames is remapped according to an S-shaped        curve (process action 1404). Intensities less than 255/2 are        scaled down toward 0 while the intensities larger than 255/2 are        scaled up toward 255.

The beginning and ending times of the chapters and the file names oftheir key frame images are saved in the index along with the time stampsof the strokes. The time stamp of a stroke is the first frame that thisstroke appears. This information has been computed in Section 2.3.3.

2.4 Browser Operation and User Interface

2.4.1 Overview.

After the analysis server processes the image sequence and produces theindex and key frame images, it sends emails to the registered sessionparticipants with the URL to the processed recording. The users canclick the URL to launch the browsing software. The goal of the browsingsoftware is to allow users to view the key frame images and quicklyaccess the audio associated with a particular topic.

The User Interface (UI) of the browsing software is shown in FIG. 15.The primary areas of the UI include a key frame pane 1504 where keyframe thumbnails 1502 (graphical representations of the key frameimages) are displayed and the main display pane of the browser thatshows a composition of the raw image 1512 from the camera and thecurrent key frame image 1502. The key frame pane 1504 also incorporatesa background transparency slider 1516 that allows the user to adjust theimage displayed in the main display pane 1506 from the raw input imageto the key frame image. Current pen strokes 1510, strokes that havealready been written in the meeting playback time line, are rendereddarker and more clearly than future strokes 1508, that have not yet beenwritten in the meeting playback timeline in the main display main. Thepen-strokes that the participants are going to write in the future 1508are shown in ghost-like style. This visualization technique will bedescribed in more detail later.

A VCR and standard timeline control 1514 is provided in the lower leftcorner of the browser UI, below the main display pane 1506. The VCR andstandard timeline control 1514 allows the user to sequence backwards orforwards slowly or quickly in the image/audio sequence or to stop, muchlike the controls found on a typical VCR. A timeline bar 1518graphically displays the length of the audio/image sequence as a bar,and provides numerical values of the start time, end time and currenttime of the meeting playback. A pointer 1520 on this bar 1518 can beselected and dragged forward and backward to linearly sequence forwardsand backwards in the image/audio sequence.

It should be noted that even though the locations of some of theaforementioned UO elements are given, this is not meant to be limiting.These UI elements could be rendered in any location on the display,either alone or in combination with other elements.

2.4.2 Non-Linear Access to Meeting Data

Two levels of non-linear access to the recorded audio were provided inthe context of visual indexing.

The first level of non-linear access is through the use of the key framethumbnails 1502. Each key frame thumbnail has a time range associatedwith it on the display. The user can click a key frame thumbnail to jumpto the starting point of the audio (e.g., beginning of the chapter) forthe corresponding key frame.

The second level of access to the recorded audio is through the use ofthe pen strokes in each key frame. When the cursor is hovering over apen stroke cell (current stroke 1510 or future stroke 1508) in the mainwindow 1506, the cursor is changed to a “hand” symbol indicating that itis selectable (e.g., “clickable” with a mouse). Double clicking on thecell with a mouse or other input device brings the application to theaudio playback mode. The playback starts from the time of the sessionwhen the clicked stroke cell was written. The time that the clickedstroke was written is the earliest time when the cell image of the samepattern appeared in the sequence. The main window starts to show theimage at that time. The user can still click other stroke cells to jumpto another part of the session.

Together with the VCR and standard time line control 1514, these twolevels of visual indexing allow the user to browse a meeting in a veryefficient way.

2.4.3 Image Viewing

As shown in FIG. 15, the thumbnails of the key frame images (e.g., 1502)are listed in the key frame pane 1504. Selecting one of the thumbnails1502 with a mouse cursor or other input device brings the correspondingkey frame image to the main window 1506 at the left and takes theapplication to the image viewing mode, where the user can zoom in andout using the zoom control buttons 1522, read the text and diagrams inthe image, or cut and paste a portion of the image to other documents.Additionally, the entire key frame can be cut and pasted to otherdocuments or printed as notes.

2.4.4 Whiteboard Content Visualization

Given the key frame images and the time stamp information, an image thatcorresponds to the whiteboard content at any given time can bereconstructed. If the image of every frame is rendered according to theaudio playback time using the timeline control 1514, the main windowplaybacks the whiteboard content like a movie. Using this approach, theusers have both the aural and visual context to the session. But theycannot click any pen stroke that takes them forward in time (futurestrokes 1508) because these strokes have not yet been rendered in themain window.

In the initial implementation of the Whiteboard Capture System, thefuture strokes were shown in a washed out mode. However, after a shorttrial period, the users of the browser often confused the future strokeswith the strokes that were not cleanly erased. Another complaint aboutthe interface was that although the users liked the whiteboard imageswithout the person in front, they sometimes wanted to know who wrote thestrokes.

After a few design iterations, the following visualization process,shown in FIG. 16, that addresses all the aforementioned concerns wasdecided on. The process actions of this process are as follows:

-   -   1. Render the current whiteboard content using the key frame        image of the current chapter and time stamp information (process        action 1602).    -   2. Render the Future Strokes, convert the results to grey scale,        and blur them using a Gaussian filter (process action 1604).    -   3. Add images from Step 1 and Step 2 (process action 1606).    -   4. Alpha-blend the image from Step 3 with the rectified image        from the input sequence (process action 1608). The rectified        image is the corresponding image from the input sequence (as        shown in FIG. 4) but with the non-whiteboard region cropped,        followed by a remapping to a rectangular shape. The user can        control the alpha value with a GUI slider (1516 of FIG. 15) from        0, showing only the rendered key frame whiteboard image, to 1,        showing exactly the original rectified image. The rendered        keyframe whiteboard image is the key frame image with the        foreground object removed and replaced by the strokes that it        occludes.        It is believed that this is a very helpful way of visualization        because 1) both present and future strokes are shown on the        rendered whiteboard image, allowing the user to jump both        backward to the past and forward to the future, and 2) blending        the rectified input image with the key frame adds the foreground        object thus giving more context. See FIG. 15 for an example of        such a visualization with alpha=0.8.        2.5 Security

Meeting participants are usually apprehensive about recording a meetingbecause sensitive information might be viewed by unintended people. Forthem, keeping the recorded data secure is a concern. To address thisconcern, a simple token-based access security model was developed. Theprocess actions of this process are shown in FIG. 17.

In the Whiteboard Capture System, meeting participants are asked toregister with the capture software at beginning of the meeting recording(process action 1702). They can either fill in their email aliases in adialog box on the computer screen or, to speedup the process, inserttheir corporate identification cards into a smart card reader toregister.

All the recorded sessions reside on a web server. If no one registers,the meeting is posted on a publicly accessible web page (process actions1704, 1706). If at least one participant registers, an access token isgenerated after the meeting recording and analysis (process action1708). The token is a long randomly generated string containing a uniquemeeting identifier. The URL containing the token is emailed to theregistered participants (process action 1710). The recipients go to theURL to launch the web browsing software to review the meeting (1712).They can also forward the URL to people who have not attended themeeting.

This simple Security-by-Obscurity model seems to work well. Othersecurity measures could, however, be employed.

In addition to the above-discussed security feature of the WhiteboardCapture System, a privacy mode is also available while recording themeeting. Should the meeting participants say or write something thatthey do not wish to have recorded, a feature exists to erase theprevious 15 seconds (although another prescribed period of time could beused) of both image and audio data. This erasure is initiated bypressing either a physical or GUI button.

2.6 Alternate Embodiments

The above-described basic Whiteboard Capture System can be combined withmany other techniques and devices to render alternate embodiments. Thevarious embodiments discussed below can be used alone or in combination.

In one such embodiment, conventional Optical Character Recognition (OCR)is performed on the key frames to provide editable text that is easilyused to create documents or presentation viewgraphs.

In another embodiment, conventional voice recognition software is usedto convert the audio portion of the captured data to text. This allowsthe easy creation of meeting minutes and other documents. It alsoprovides a relatively inexpensive way to provide meeting information tothe hearing impaired.

The Whiteboard Capture System can also be made portable by using, forexample, a notebook computer with a microphone and a camera on a tripod.This configuration only requires an additional initial calibration todetermine the location of the camera relative to the whiteboard. Thiscalibration could be performed manually by manually determining the fourcorners of the panel in the image, or automatically, by using suchconventional methods as edge detection.

The analysis software of the Whiteboard Capture System can also be usedto determine key frames with the whiteboard capture systems that use pentracking to infer whiteboard content. Since the history of the pencoordinates is typically captured in vector form in these systems, thecontent on the whiteboard at any given moment can be reconstructedlater. Using the Whiteboard Capture System analysis software with such asystem simplifies the analysis process. There is no determination ofwhiteboard background color or whiteboard region rectificationnecessary, no spatial and temporal filtering required, and theclassification of whiteboard cells is simpler because cell images willeither be stroke or whiteboard, since no foreground object willinterfere with the text written on the whiteboard. The cell “images” arenow derived from the content inferred by the pen locations over thewhiteboard area. This embodiment of the invention basically clusters thecell “images” as discussed in FIG. 5, process action 506, classifieseach cell as a stroke or whiteboard cell similar to process action 508except for that there are no foreground cells, and extracts the keyframe images using the classification results (process action 512). Theresults can be transmitted and archived with low bandwidth and smallstorage requirements. Additionally, OCR can be used to transcribe thecaptured key frames in this embodiment also.

Additionally, in a working embodiment of the Whiteboard Capture System,the frame rate of the system is limited by the frame rate of thecommercially available still cameras. To achieve higher frame rate, ahigh resolution video camera such as a HDTV camera can be used.

In yet another embodiment, the Whiteboard Capture System incorporatesgesture recognition to use gesture commands. For instance, a command boxcan be written somewhere on the whiteboard. When the user motions orpoints to the box the system uses gesture recognition to time stamp theimages at the particular time the gesture was made.

In the basic application, the analysis process assumes that the color ofthe whiteboard background remains constant in an input sequence.However, a known color patch can be installed above the top of thewhiteboard where nobody can obscure it from the camera. The software canthen adjust the camera exposure parameters for different lightingconditions on a per-frame basis, based on the known colorcharacteristics of this easily detectable patch. This is done asfollows. If the exposure parameters stay constant, the color of thepatch can be different in the captured images with different lightingconditions in the room. The camera can adjust its exposure parametersgiven what the color of the patch is in the previous frame. The color ofthe patch can stay within a specified range and so will the whiteboardregion.

3.0 System Performance and Usage

3.1 Background.

The design goals of the Whiteboard Capture System were that it should 1)work with any existing whiteboard; 2) capture the whiteboard contentautomatically and reliably; and 3) use the whiteboard content as avisual index to efficiently browse a meeting recorded using the system.

Compared to the whiteboard capture systems that use a sensing mechanismor an electronic whiteboard, the Whiteboard Capture System also had aset of unique technical challenges. Firstly, the whiteboard backgroundcolor is not typically pre-calibrated (for example, by taking a pictureof a blank whiteboard) because each room has several light settings thatcould vary from session to session. Secondly, frequently, people movebetween the digital camera and the whiteboard, and these foregroundobjects obscure some portion of the whiteboard and cast shadows on it.Within a sequence, there may be no frame that is totally un-obscured.These problems had to be dealt with in order to compute time stamps andextract key frames.

3.2 System Components

During the design of the Whiteboard Capture System, prototype systemswere built and iteratively improved. Three conference rooms wereequipped with a Whiteboard Capture System. Information about these threerooms is listed in Table 1 below. Sample images (80×80 pixels,approximately 96 point font on the board) are shown in FIG. 18A (imagescorrespond from left to right to Room 1, Room 2 and Room 3,respectively). TABLE 1 Information About Three Installation Sites Room 1Room 2 Room 3 Board Dimension (feet)  4 × 3 8 × 5 12 × 5  Key FrameImage 1200 × 900 2400 × 1500 2400 × 1000 Dimension (pixel) Resolution(dpi) 25 25 16.7

The sizes of whiteboards in those rooms varied and so did the qualitiesof the key frame images produced. As can be seen from the sample images(FIG. 18A), the writings on a 12′×5′ board are fuzzier (far right) thanthe ones on the other two boards because the resolution is maxed out fora 4 mega-pixel input image. Nevertheless, they are still quite legible.Several selected frames from a session using a 12′×5′ whiteboard (FIG.18B) and the corresponding key frame images (FIGS. 18C and 18D) are alsoshown.

Since the system is to work with any existing whiteboard, without theneed for special pens and erasers, a direct capture device, a stillcamera, was chosen to capture the whiteboard content. In the exemplaryworking embodiment of the Whiteboard Capture System, a Canon® PowerShotG2 digital still camera with 4 mega pixels was used. This cameraprovides images that are 2272 pixels by 1704 pixels—equivalent to 31.6dpi for a 6′ by 4′ board. One important reason that this camera waschosen was due to the availability of a software development kit thatallows customized software solutions to be written to control the camerafrom a PC. This software can specify virtually all the camera parameterson a per-shot basis. Since the system takes pictures of the whiteboarddirectly, there is no mis-registration of the pen strokes. As long asthe users turn on the system before erasing, the content is preserved.

The analysis server runs on a Pentium III 800 MHz dual CPU PC. Theanalysis process takes about 20 minutes for every hour of session time.The storage requirement for the 16 bit 11 KHz mono audio takes about 15Mb per hour using MP3 encoding. The input image sequence requires about34 Mb per hour using Motion JPEG compression.

The systems installed in the three conference rooms were used frequentlyby various teams. Over the course of 6 months, 108 sessions totaling 48hours were recorded—averaging 27 minutes per session and 4.5 sessionsper week. The average number of key frames per session was 2.7. The keyframe images were saved in JPEG format. The average image size was 51.8Kb. The sizes ranged from 17 Kb to 150 Kb. Because the JPEG compressionworked extremely well on the uniform white background, the image sizewas more related to how much the users write on the board than the imagedimension.

All users of the system believed that the system is very useful formeetings that use a whiteboard extensively. The key frame images and thevisual indexing capability not only allow the participants to review ameeting at a later time, but also allow the users who did not attend themeeting to understand the gist of the meeting in a fraction of theactual meeting time.

Some users found new ways to use the system that were not intendedinitially. Take the example of status meetings that usually did notrequire writing on whiteboard. People still turned on the whiteboardcapture system. When it was someone's turn to speak, the manager wrotehis/her name on the board so that the speech segments could be easilyfound later in the recorded audio by clicking on the names in the keyframe image. Another example is during a brainstorm session, whensomeone thought of a good idea, he wrote a star on the side of the boardand said it aloud. The audio can then be retrieved later by clicking onthe star.

The foregoing description of the invention has been presented for thepurposes of illustration and description. It is not intended to beexhaustive or to limit the invention to the precise form disclosed. Manymodifications and variations are possible in light of the aboveteaching. It is intended that the scope of the invention be limited notby this detailed description, but rather by the claims appended hereto.

Appendix: Plane-Based Whiteboard Color Estimation

Only one component of the color image is considered, but the techniquedescribed below applies to all components (R, G, B, or Y). Each cell iis defined by its image coordinates (x_(i), y_(i)). Its color isdesignated by z_(i) (z=R, G, B, or Y). The color is computed asdescribed in Section 2.3.2, and is therefore noisy and even erroneous.From experience with the meeting rooms, the color of the whiteboardvaries regularly. It is usually much brighter in the upper part andbecomes darker toward the lower part, or is much brighter in one of theupper corners and becomes darker toward the opposite lower corner. Thisis because the lights are installed against the ceiling. Therefore, fora local region (e.g., 7×7 cells), the color can be fit accurately by aplane; for the whole image, a plane fitting is still very reasonable,and provides a robust indication whether a cell color is an outlier.

A plane can be represented by ax+by+c−z=0. A set of 3D points{(x_(i),y_(i),z_(i))|i=1, . . . , n} with noise only in z_(i) is given.The plane parameters p=[a,b,c]^(T) can be estimated by minimizing thefollowing objective function: ${F = {\sum\limits_{i}f_{i}^{2}}},$where f_(i)=ax_(i)+by_(i)+c−z_(i). The least-squares solution is givenby p=(A^(T)A)⁻¹A^(T)z, where $A = {{\begin{bmatrix}x_{1} & y_{1} & 1 \\\cdots & \cdots & \cdots \\x_{n} & y_{n} & 1\end{bmatrix}\quad{and}\quad z} = {\left\lbrack {z_{1},\ldots\quad,z_{n}} \right\rbrack^{T}.}}$Once the plane parameters are determined, the color of the cell i isreplaced by {circumflex over (z)}_(i)=ax_(i)+by_(i)+c.

The least-squares technique is not robust to erroneous data (outliers).As mentioned earlier, the whiteboard color initially computed doescontain outliers. In order to detect and reject outliers, a robusttechnique to fit a plane to the whole whiteboard image is used. Theleast-median-squares [11], a very robust technique that is able totolerate near half of the data to be outliers, is used. The idea is toestimate the parameters by minimizing the median, rather than the sum,of the squared errors, i.e.,$\min\limits_{p}\quad{\underset{i}{median}\quad{f_{i}^{2}.}}$First m random subsamples of 3 points are drawn (3 is the minimum numberto define a plane). Each sub-sample gives an estimate of the plane. Thenumber m should be large enough such that the probability that at leastone of the m sub-samples is good is close to 1, say 99%. If it isassumed that half of the data could be outliers, then m=35, thereforethe random sampling can be done very efficiently. For each sub-sample,the plane parameters and the median of the squared errors f_(i) ² arecomputed. The plane parameters that give the minimum median of thesquared errors were retained, denoted by M. Then the so-called robuststandard deviation σ=1.4826√{right arrow over (M)} (the coefficient isused to achieve the same efficiency when no outliers are present) iscomputed. A point i is considered to be an outlier and discarded if itserror |f_(i)|>2.5σ. Finally, a plane is fit to the good points using theleast-squares technique described earlier. The color of an outlier celli is replaced by {circumflex over (z)}_(i)=ax_(i)+by_(i)+c.

1-59. (canceled)
 60. A system for capturing the audio and video contentof a meeting comprising: a capture system that captures a sequence ofimages of data written on a non-electronic whiteboard and audio signalscorresponding to sounds that occur during a meeting; an analysis serverfor analyzing the sequence of images that extracts key data frameswritten on the whiteboard and correlates the audio signals to the keydata frames; and a browsing module for viewing the analyzed meeting keydata frames and correlated audio.
 61. The system of claim 60 wherein thecapture system further comprises: a camera positioned to capturesequence of images; a microphone to record the audio signals; and acomputer for recording the sequence of images and the audio signals. 62.The system of claim 61 wherein the camera is at least one of: a stillcamera; and a video camera.
 63. The system of claim 61 wherein thecamera is zoomed in as close to the whiteboard as possible to maximizeresolution.
 64. The system of claim 61 wherein the camera is aligned asparallel to the whiteboard as possible to minimize scene depth.
 65. Thesystem of claim 60 wherein the analysis server identifies the key dataframes by: rectifying a view of the whiteboard in every image in thesequence of images; extracting whiteboard background color; dividingeach image of the sequence of images into cells of cell images;clustering cell images that are similar throughout the sequence ofimages for each cell over time; classifying each cell image as a stroke,a foreground object or whiteboard cell; and extracting key frame imagesusing the classification results.
 66. The system of claim 60 wherein theanalysis server identifies the key data frames by: rectifying a view ofthe whiteboard in every image in the sequence of images; extractingwhiteboard background color; clustering pixels that are similarthroughout the sequence of images for each cell over time; classifyingeach pixel as a stroke, a foreground object or whiteboard cell; andextracting key frame images using the classification results.
 67. Thesystem of claim 60 wherein one or more users register a user identifierat the capture unit before recording starts; if at least one userregisters at the capture unit, the analysis server generates an accesstoken after event recording and analysis; the access token and computermemory location of analyzed meeting data is provided to the registereduser identifiers; and the one or more users access the computer memorylocation of the analyzed event data to review the analyzed meeting data.68. The system of claim 67 wherein the user identifier is an emailaddress.
 69. The system of claim 67 wherein the computer memory locationof the analyzed event data is an address of an Internet web site. 70.The system of claim 60 further comprising a privacy feature in saidcapture unit that allows a user erase at least one of: portions of thesequence images; and portions of the audio.
 71. The system of claim 70wherein said privacy feature is activated by pressing either a graphicaluser interface button or a physical button.
 72. The system of claim 60wherein the capture system is portable. 73-78. (canceled)
 79. A systemfor distilling the content of a meeting comprising: a capture systemthat captures a sequence of data written on a non-electronic whiteboard,said capture system tracking pen location to infer content written on awhiteboard and recording audio signals correlating to said contentwritten on a whiteboard; an analysis server for analyzing the sequenceof images that extracts key data frames written on the whiteboard andcorrelates the audio signals to the key data frames.
 80. The system ofclaim 79 wherein said analysis server performs the following actions:dividing each region of the whiteboard into cells; clustering cells thatare the same throughout the sequence of data written for each cell overtime; classifying each cell as a stroke or a whiteboard cell; andextracting key frame images using the classification results.
 81. Thesystem of claim 80 wherein the whiteboard cells are divided into cellsthat are approximately the size of one written character.
 82. (canceled)83. A process for summarizing and indexing audiovisual content,comprising the following process actions: capturing a sequence of imagesof content written on a non-electronic white board with a camera;recording audio signals correlated with the sequence of images; andanalyzing the sequence of images to isolate key frames summarizing keypoints of said board content.
 84. The process of claim 83 furthercomprising correlating said audio recordings with said key frames. 85.The process of claim 84 wherein said audio signals are correlated withsaid sequence of images by time stamps associated with both the recordedaudio and the sequence of images.
 86. The process of claim 85 whereincorrelating the audio signals with said sequence of images comprises theprocess actions of: time stamping said sequence of images with a commonclock at the time the images are captured; time stamping said audiosignals with a common clock at the time the audio signals are recorded;and correlating the sequence of images and audio signals using the timestamps of the common clock.
 87. The process of claim 85 furthercomprising accessing said sequence of images and said correlated audiosignals at a desired point in said sequence of images.
 88. The processof claim 87 wherein said key frames are used to select said desiredpoint in said sequence.